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Abstract 

In this paper, we present two classes of Bayesian approaches to the two- 
sample problem. Our first class of methods extends the Bayesian t-test to 
include all parametric models in the exponential family and their conju- 
gate priors. Our second class of methods uses Dirichlet process mixtures 
(DPM) of such conjugate-exponential distributions as flexible nonpara- 
metric priors over the unknown distributions. 

1 Introduction 



^ ' In this paper, we tackle the so-called two-sample problem: 

<N : 

fT^ , Problem Statement 1 Given two samples X — {xi, . . . ,Xm,i} ~ 9i and Y = 

' {yi, . . . , Um^} ~ 92 from two underlying distributions qi and q^. The two-sample 

~~~J^ , problem is to decide whether qi = q2- 

. An associated test is called a two-sample test. Such tests are encountered in 

^\ I various disciplines from the life sciences to the social sciences: 

, • In medical studies, one may want to find out if two classes of patients 

_ ^ ' show different behaviour, response to a drug or susceptibility to a disease. 

\ • In microarray analysis, one may compare measurements from different 

' weeks, labs or platforms to find out if they follow the same distribution, 

before integrating them into one dataset, in order to increase sample size. 

• In the neurosciences, one may want to compare measurements of brain 
signals under different external stimuli, to check whether brain activity is 
affected by these stimuli. 

• In the social sciences, one may want to compare whether the behavior of a 
group of people, e.g. when they graduate, marry, or die, is different across 
countries or generations. 

• In the financial sciences, one could for example compare the set of trans- 
actions performed at a stock exchange during different weeks, to find out 
if there is a change in activity in the financial markets. 
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While this question has been studied in detail by classic statistics for uni- 
variate data, there is less work on multivariate data (which we review in Section 
2). The only machine learning approach to this problem is a kernel method by 
(Grctton ct al.. 2007), using the means of the two samples X and Y m a, imivcr- 
sal reproducing kernel Hilbert Space as its test statistics, but it has created lots 
of interest in that subject and follow-on studies (Borgwardt et al., 2006; Huang 
et al., 2007; Grctton & Gyorfi, 2008). 

Here, we approach this two-sample problem from a Bayesian perspective. 
The classic Bayesian formulation of this problem would be in terms of a Bayes 
factor (Kass & Raftcry, 1995) which represents the likelihood ratio that the data 
were generated according to hypothesis Tig (that is from the same distribution) 
or hypothesis Wi (that is from different distributions). However, how to exactly 
define these two hypotheses is a crucial question, and many answers have been 
given in the Bayesian literature with hypotheses that are tailored to a specific 
problem or application domain; one example are the Bayesian t-tests used in 
microarray data analysis (Baldi & Long, 2001; Fox & Dimmic, 2006). Our goal 
in this paper is to define two general classes of two-sample tests that represent a 
precise formulation of the two-sample problem, but are not tailored to a specific 
application. They arc designed to offer an attractive middle ground between 
the general idea of using Bayes factors and the specialised hypotheses testing 
problems studied in the literature. 

In detail, we define a class of nonparamctric Bayesian two sample tests based 
on Dirichlet process mixture models. The use of Dirichlet process mixtures 
for flexible nonparametric modelling of general unknown distributions has a 
long history in Statistics. However, although the two-sample problem depends 
crucially on testing whether data come from one or two unknown distributions, 
Bayesian approaches based on nonparametric density models have not been 
explored to date. Here we propose and explore such a non-parametric method 
using the classic Dirichlet process mixture. To the best of our knowledge, the 
only work that is remotely related is that on a Bayesian test for a parametric 
versus a nonparametric model of the data by Berger and Guglielmi (Berger & 
Guglielmi, 1998). This addresses a different but related question since it assumes 
a parametric null hypothesis. We also define a parametric Bayesian two-sample 
test where the model of the data is a member of the exponential family. This 
test generalizes the Bayesian t-test by (Baldi & Long, 2001) and (Fox & Dimmic, 
2006), who assume that the samples are Gaussian. 

This paper is structured as follows. In Section 2 we will review existing 
approaches to the two-sample problem on multivariate data, and highlight some 
differences between frequentist and Bayesian hypothesis testing. In Section 3 we 
outline the common core of our two Bayesian two-sample test, before providing 
the details on the parametric test in Section 4 and on the non-parametric test 
in Section 5. 
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2 Multivariate two-sample tests 



Related work in statistics and kernel machines Our method is a Bayesian 
approach to a problem that has been studied in classie statisties and kernel ma- 
chine learning. Here we describe in short the prominent multivariate two-sample 
tests (see also (Gretton et al., 2007)). 

Frequentist two sample tests follow the same principle of classic hypothesis 
testing: Given the two samples X and Y, a test statistic is computed. Then 
the distribution of this test statistics under the null distribution {qi = q^) is 
determined. If the value of the test statistic falls into the 1 — a-quantile of the 
null distribution, the null hypothesis qi — q2 is accepted at significance level a. 
If its value exceeds the 1 ^ a quantile, it is rejected at significance level a. So 
the outcome of these test depends on the significance level a which has to be 
chosen apriori. 

Frequentists tests differ mainly in two points: a) the test statistic they em- 
ploy and b) the way in which they determine the null distribution for this test 
statistic. The classic multivariate t-test (Hotelling, 1951) assumes that both dis- 
tributions are multivariate Gaussian with unknown identical covariance; Fried- 
man and Rafsky (Friedman & Rafsky, 1979; Henze & Penrose, 1999) define test 
statistics based on spanning trees, namely the number of edges that connect 
points from X to F in a minimum spanning tree (Wald-Wolfowitz test) and 
the closeness of points from X and y in a ranking derived from the minimum 
spanning tree (Kolmogorov-Smirnow test) (Bickel, 1969; Friedman & Rafsky, 
1979). Rosenbaum^s test statistic is the number of pairs containing a data point 
from X and K in a minimum distance non-bipartite matching over X U Y. 
Hall and Tajvidi (Hall & Tajvidi, 2002) essentially for each data point count 
its number of nearest neighbours in X L) Y that are from the other sample. 
Biau and Gyorfi's statistic is the distance between Parzen window estimates of 
the densities (Anderson et al., 1994; Biau & Gyorfi, 2005). Gretton et al. use 
the distance between the means of X and Y in a universal reproducing kernel 
Hilbert space as their test statistic (Gretton et al., 2007). 

Frequentist versus Bayesian approach In contrast to classic hypothesis 

testing, the test statistic in Bayesian hypothesis testing is a so-called Bayes 
factor. It is the ratio of the likelihoods of two opposing hypotheses having 
generated the data D = {X, Y}, the hypothesis Ho {qi = 92) and its alternative 

Hi {qi ^ q2). 

To summarize, frequentist classic hypothesis testing considers only one hy- 
pothesis and evidence against it, whereas Bayesian hypothesis testing compares 
the likelihoods of two alternative hypotheses having generated the data at hand. 
While the question of which perspective is to prefer is still an ongoing and un- 
resolved debate, we deem it useful to have a Bayesian alternative to the classic 
frequentist two sample tests for the following reasons: Bayesian approaches have 
a clear interpretability compared to the commonly used p-values. Prior knowl- 
edge on the probability of the two hypotheses can be incorporated into the Bayes 
factor in a straightforward manner. 
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3 Concept of Bayesian two-sample tests 



3.1 Bayes factor as test criterion 

Our two classes of Bayesian two sample tests are based on the idea to compute a 
Baycs factor between two alternative hypotheses: the hypothesis Hi that both 
samples were independently generated from different underlying distributions 
gi and q2 with qi ^ q2, and the hypothesis Ho that they originated from the 
same distribution q (gi =92)- This idea is formalised in the following lemma. 

Lemma 1 Given two samples X ^ qi and F ~ 52, we accept the hypothesis 
Hi that qi ^ 52 if the Bayes factor 

PiX,Y\Hi) 

^ PiX,Y\Ho) ^ ' 

otherwise we accept the hypothesis Hq that Qi ^ q2 = Q- 

The hypothesis Hi is that the samples originate from different distributions, 
such that 

P{X, Y\Hi) = P{X\Hi)P{Y\Hi). (2) 

3.2 Computation of the Bayes factor 

The central challenge when computing the Bayes factor x is that we do not 
know the distributions qi, q2, and q our Bayes factor is based upon. Since q, qi 
and q2 arc unknown probability distributions, we have to compute the integral 
over all such distributions with respect to some prior on distributions. We offer 
two classes of solutions here. In Section 4, we present a parametric test where 
the distributions are in the exponential family and have conjugate priors. In 
Section 5, we present a non-parametric test where the distributions 91,92, and 
q are assumed to be drawn from a Dirichlet Process mixture model. 

4 Parametric Bayesian two-sample test 

4.1 Exponential Families 

For the parametric Bayesian two-sample test, we assume that the underlying 
distributions qi and 92 are in the exponential family: The distribution for models 
from this family can be written in the form 

p(x|^)=/(x)ff(^)exp{0^u(x)}, (3) 

where u(x) is a _ft'-dimcnsional vector of sufficient statistics, arc the natural 
parameters, and / and g arc non-negative functions. The conjugate prior is 

p{e\r], u) = hiv, P)gier exp{^^T^.}, (4) 

where r] and u are hyperparameters, and h normalizes the distribution. 
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4.2 Bayes factor of parametric test 

The Bayes factor of the parametric two-sample test can then be 

_ P{X\(i)P{Y\l3) _ 
^ P{X,Y\P) 
_ J P{X\0)P{e\(3)de J P{Y\0)P{0\P)d9 _ 
~ J P{X.Y\O)P{0\6)d0 ~ 

_ hjij, ;/) //.(;/ + /»! + 111-2, 1^ + u(A") + u{Y)) 
h{r] + mi, u + u{X)) h{r] + m2, f + u(F)) ' 

where 

mi 7712 

uW=^u(xi),u(r)=^u(2/,), (8) 
and /? is the set of hyperparameters {r], v} of the prior. 

5 Nonparametric Bayesian two-sample test 

Unhke its parametric counterpart, our nonparametric Bayesian two-sample test 
does not employ one single model for the data, but rather the limit of infinitely 
many components of a finite mixture model: P{X\a,(3) = J P{X\q)P{q\a, /3)dq 
where q is an unknown distribution, modelled as an infinite mixture, and a and 
P are hyperparameters controlling it. 

This can be achieved via a Dirichlet process mixture of members of the 
exponential family. The Bayes factor for the nonparametric two-sample test 
equals 

^ P{X\a,/3)P{Y\a,P) 
^ P{X,Y\a,(3) ^ ' 

where P{X\a,/3) is the marginal probability of sample X under a Dirichlet 

Process Mixture Model (analogous definitions for P{X\a,/3) and P{X,Y\a, P)) 
with concentration parameter a and base measure hyperparameter /3. 

5.1 Dirichlet Process Mixture Models 

A key component in our nonparametric two sample test is the ability to approx- 
imately infer the marginal probability of a set of observations from a Dirichlet 
Process Mixture Model (DPM). As these DP Ms are at the heart of our non- 
parametric two-sample test, let us review them here (Ferguson, 1973; Antoniak, 
1974). 

A Dirichlet Process (DP), and also a Dirichlet Process Mixture Model (DPM), 
is a probability distribution on probability distributions, and DPMs consider the 
limit of infinitely many components of a finite mixture model. By allowing for 



computed as 
(5) 
(6) 
(7) 
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an infinite number of components, we are able to model the complicated distri- 
butions that we encounter in real- world applications via DPMs. 
Consider a finite mixture model with C components 

c 

p{x^\^) = Y,p{x^\eMci = m (10) 

where Ci G {1, . . . , C} is a cluster indicator variable for data point i, Q are the 
parameters of a multinomial distribution with 

p(ci = ilC) = 0, (11) 

6j are the parameters of the jth component, and 

4>={6r,...,ec,Q- (12) 

Let the parameters of each component have conjugate priors p{9\(i) as before, 
and the multinomial parameters also have a conjugate Dirichlet prior 

K<i")^F^ncr-' (13) 

Given a data set V = {x^^\ . . . ,a;^"^}, the marginal likelihood for this mix- 
ture model is 

/n 
\^p{x^^\4'M^\a,(5)dcl>, (14) 

where 

c 

p{<i>\a,p)=p{C\a)\{p{ej\0). (15) 

This marginal likelihood can be rewritten as 

p{V\a, d) = ^p(c|a)p(^|c, P) (16) 

C 

where c = (ci , . . . , c^) and 

p{c\a) = I p{c\Op{C\a)d<: (17) 

is a standard Dirichlet integral. The quantity (16) is well-defined even in the 

limit C ^ oo. Although the number of possible settings of c grows as C" and 
therefore diverges as C oo, the number of possible ways of partitioning the n 
points remains finite (roughly 0(n")). Using V to denote the set of all possible 
partitioning of n data points, we can re-write (16) as 

p{V\a, P) = Y^ piv\a)p{V\v, /3) (18) 
vev 
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5.2 Approximate inference of mcirginal probabilities un- 
der DPM 



While finite, the number of partitions stiU grows as 0(n") with the size n of the 
dataset, rendering an exact inference of the marginal probabilities under a DPM 
intractable even for moderate size datascts (roughly n > 10). Hence we have to 
resort to approximate inference methods for computing these marginals. One 
choice is Bayesian hierarchical clustering (BHC), a clustering algorithm that 
can be used for approximate inference of marginal probabilities under a DPM 
in 0(n2) (Heller & Ghahramani, 2005). 



5.3 Bayes factor in nonpcirametric test 

For the nonparametric two-sample test we use a DPM as the distribution on 
distributions q, qi, (j2- This allows us to integrate out the parameters of the 
unknown underlying probability distributions q, qi and q2 in a Bayesian manner, 
while employing a flexible model for these distributions. 
The Bayes factor x from (1) can then be computed as 

_ J P{X\q^)P{q,\a,(3)dq^ * J P{Y\q2)P{q2\a,P)dq2 
^ fP{X,Y\q)P{q\a,p)dq ^ ' 

^ PiX\a,P)PiY\a,(3) 

P{X,Y\a,f3) ' ^ ' 

where P{X\qi) = nr=i lii^i) P{<li\cejP) is a Dirichlet process mixture 
with concentration parameter a and base measure hyperparameter (3. Hence 

P{X,Y\a, (3) denotes the marginal probability that X and Y were generated 
from this DPM with hyperparameters a and (3 (analogous for P{X\a,(3) and 
P{Y\a,0)). 



6 Discussion and Conclusions 

In this paper, we have proposed two classes of Bayesian two sample tests, a 
parametric test based on distributions from the exponential family, and a non- 
parametric test based on Dirichlet Process Mixture Models. 

An issue of future work will be the runtime of two-sample tests. Frequentist 
tests arc often expensive to compute, as the test statistic often requires at 
least an runtime of O(n^) for n datapoints and bootstrapping for determining 
the null distribution. The Bayesian tests avoid this bootstrapping step and 
there exist various approximations to a Dirichlet process mixture model (Blei & 
Jordan, 2005; Kurihara et al., 2006; Kurihara et al., 2007), some of which can 
be computed in less than O(n^). Hence the Bayesian approach might hold the 
key for eflacient two-sample tests, which we will look at in future work. 
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